174 research outputs found
InDL: A New Datasets and Benchmark for In-Diagram Logic Interpreting based on Visual Illusion
This paper introduces a novel approach to evaluating deep learning models'
capacity for in-diagram logic interpretation. Leveraging the intriguing realm
of visual illusions, we establish a unique dataset, InDL, designed to
rigorously test and benchmark these models. Deep learning has witnessed
remarkable progress in domains such as computer vision and natural language
processing. However, models often stumble in tasks requiring logical reasoning
due to their inherent 'black box' characteristics, which obscure the
decision-making process. Our work presents a new lens to understand these
models better by focusing on their handling of visual illusions -- a complex
interplay of perception and logic. We utilize six classic geometric optical
illusions to create a comparative framework between human and machine visual
perception. This methodology offers a quantifiable measure to rank models,
elucidating potential weaknesses and providing actionable insights for model
improvements. Our experimental results affirm the efficacy of our benchmarking
strategy, demonstrating its ability to effectively rank models based on their
logic interpretation ability. As part of our commitment to reproducible
research, the source code and datasets will be made publicly available here:
\href{https://github.com/rabbit-magic-wh/InDL}{https://github.com/rabbit-magic-wh/InDL}
SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation
In this paper, we introduce an SE(3) diffusion model-based point cloud
registration framework for 6D object pose estimation in real-world scenarios.
Our approach formulates the 3D registration task as a denoising diffusion
process, which progressively refines the pose of the source point cloud to
obtain a precise alignment with the model point cloud. Training our framework
involves two operations: An SE(3) diffusion process and an SE(3) reverse
process. The SE(3) diffusion process gradually perturbs the optimal rigid
transformation of a pair of point clouds by continuously injecting noise
(perturbation transformation). By contrast, the SE(3) reverse process focuses
on learning a denoising network that refines the noisy transformation
step-by-step, bringing it closer to the optimal transformation for accurate
pose estimation. Unlike standard diffusion models used in linear Euclidean
spaces, our diffusion model operates on the SE(3) manifold. This requires
exploiting the linear Lie algebra associated with SE(3) to
constrain the transformation transitions during the diffusion and reverse
processes. Additionally, to effectively train our denoising network, we derive
a registration-specific variational lower bound as the optimization objective
for model learning. Furthermore, we show that our denoising network can be
constructed with a surrogate registration model, making our approach applicable
to different deep registration networks. Extensive experiments demonstrate that
our diffusion registration framework presents outstanding pose estimation
performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.Comment: Accepted by NeurIPS-202
A Multi-scale Learning of Data-driven and Anatomically Constrained Image Registration for Adult and Fetal Echo Images
Temporal echo image registration is a basis for clinical quantifications such
as cardiac motion estimation, myocardial strain assessments, and stroke volume
quantifications. Deep learning image registration (DLIR) is consistently
accurate, requires less computing effort, and has shown encouraging results in
earlier applications. However, we propose that a greater focus on the warped
moving image's anatomic plausibility and image quality can support robust DLIR
performance. Further, past implementations have focused on adult echo, and
there is an absence of DLIR implementations for fetal echo. We propose a
framework combining three strategies for DLIR for both fetal and adult echo:
(1) an anatomic shape-encoded loss to preserve physiological myocardial and
left ventricular anatomical topologies in warped images; (2) a data-driven loss
that is trained adversarially to preserve good image texture features in warped
images; and (3) a multi-scale training scheme of a data-driven and anatomically
constrained algorithm to improve accuracy. Our experiments show that the
shape-encoded loss and the data-driven adversarial loss are strongly correlated
to good anatomical topology and image textures, respectively. They improve
different aspects of registration performance in a non-overlapping way,
justifying their combination. We show that these strategies can provide
excellent registration results in both adult and fetal echo using the publicly
available CAMUS adult echo dataset and our private multi-demographic fetal echo
dataset, despite fundamental distinctions between adult and fetal echo images.
Our approach also outperforms traditional non-DL gold standard registration
approaches, including Optical Flow and Elastix. Registration improvements could
also be translated to more accurate and precise clinical quantification of
cardiac ejection fraction, demonstrating a potential for translation
Policy Space Diversity for Non-Transitive Games
Policy-Space Response Oracles (PSRO) is an influential algorithm framework
for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games.
Many previous studies have been trying to promote policy diversity in PSRO. A
major weakness in existing diversity metrics is that a more diverse (according
to their diversity metrics) population does not necessarily mean (as we proved
in the paper) a better approximation to a NE. To alleviate this problem, we
propose a new diversity metric, the improvement of which guarantees a better
approximation to a NE. Meanwhile, we develop a practical and well-justified
method to optimize our diversity metric using only state-action samples. By
incorporating our diversity regularization into the best response solving in
PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We
present the convergence property of PSD-PSRO. Empirically, extensive
experiments on various games demonstrate that PSD-PSRO is more effective in
producing significantly less exploitable policies than state-of-the-art PSRO
variants
Maximum Entropy Heterogeneous-Agent Mirror Learning
Multi-agent reinforcement learning (MARL) has been shown effective for
cooperative games in recent years. However, existing state-of-the-art methods
face challenges related to sample inefficiency, brittleness regarding
hyperparameters, and the risk of converging to a suboptimal Nash Equilibrium.
To resolve these issues, in this paper, we propose a novel theoretical
framework, named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML),
that leverages the maximum entropy principle to design maximum entropy MARL
actor-critic algorithms. We prove that algorithms derived from the MEHAML
framework enjoy the desired properties of the monotonic improvement of the
joint maximum entropy objective and the convergence to quantal response
equilibrium (QRE). The practicality of MEHAML is demonstrated by developing a
MEHAML extension of the widely used RL algorithm, HASAC (for soft
actor-critic), which shows significant improvements in exploration and
robustness on three challenging benchmarks: Multi-Agent MuJoCo, StarCraftII,
and Google Research Football. Our results show that HASAC outperforms strong
baseline methods such as HATD3, HAPPO, QMIX, and MAPPO, thereby establishing
the new state of the art. See our project page at
https://sites.google.com/view/mehaml
Controllable Textual Inversion for Personalized Text-to-Image Generation
The recent large-scale generative modeling has attained unprecedented
performance especially in producing high-fidelity images driven by text
prompts. Text inversion (TI), alongside the text-to-image model backbones, is
proposed as an effective technique in personalizing the generation when the
prompts contain user-defined, unseen or long-tail concept tokens. Despite that,
we find and show that the deployment of TI remains full of "dark-magics" -- to
name a few, the harsh requirement of additional datasets, arduous human efforts
in the loop and lack of robustness. In this work, we propose a much-enhanced
version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all
the aforementioned problems and in turn delivering a robust, data-efficient and
easy-to-use framework. The core to COTI is a theoretically-guided loss
objective instantiated with a comprehensive and novel weighted scoring
mechanism, encapsulated by an active-learning paradigm. The extensive results
show that COTI significantly outperforms the prior TI-related approaches with a
26.05 decrease in the FID score and a 23.00% boost in the R-precision.Comment: 10 pages, 6 figures, 2 tables. Project Page:
https://github.com/jnzju/COT
Multi-Task Learning with Multi-Query Transformer for Dense Prediction
Previous multi-task dense prediction studies developed complex pipelines such
as multi-modal distillations in multiple stages or searching for task
relational contexts for each task. The core insight beyond these methods is to
maximize the mutual effects between each task. Inspired by the recent
query-based Transformers, we propose a simpler pipeline named Multi-Query
Transformer (MQTransformer) that is equipped with multiple queries from
different tasks to facilitate the reasoning among multiple tasks and simplify
the cross task pipeline. Instead of modeling the dense per-pixel context among
different tasks, we seek a task-specific proxy to perform cross-task reasoning
via multiple queries where each query encodes the task-related context. The
MQTransformer is composed of three key components: shared encoder, cross task
attention and shared decoder. We first model each task with a task-relevant and
scale-aware query, and then both the image feature output by the feature
extractor and the task-relevant query feature are fed into the shared encoder,
thus encoding the query feature from the image feature. Secondly, we design a
cross task attention module to reason the dependencies among multiple tasks and
feature scales from two perspectives including different tasks of the same
scale and different scales of the same task. Then we use a shared decoder to
gradually refine the image features with the reasoned query features from
different tasks. Extensive experiment results on two dense prediction datasets
(NYUD-v2 and PASCAL-Context) show that the proposed method is an effective
approach and achieves the state-of-the-art result
- …